AMRflows Metagenomic Data Analysis Course

7. Sequencing filetypes

FASTQ

Raw sequencing reads are typically provided in the FASTQ format. This standardised format is composed of four lines of information. Here’s what a single read looks likke in fastq format.

@SEQ_ID
CTGAGTTTTTAACTAAATAGTCTAAAAATGGTTTTCGTTT
+
!(((**+%%''.1--))++(%)ABFFFFF<<99GG00//.

Line 1 contains the @ prefix, and contains the name/title/description of the sequence. Typically this is the read ID generated by the sequencing device and the name of your sample.
Line 2 is the raw sequence data, i.e. the individual nucleotides that have been basecalled.
Line 3 contains the + prefix, this is normally empty in modern sequencing, but may repeat the information from line 1.
Line 4 contains the ASCII encoded quality score for each basecalled nucleotide in line 2.

FASTQ files contains all the reads relating to a sequencing sample. If we had 10,000 reads we would have 4 x 10,000 = 40,000 lines of text. If we want to find out how many reads are in our file we can use the word count by line command and divide by 4:

wc -l reads.fastq

Sequencing Quality

In Illumina sequencing, each base in a read is assigned a Phred score, which are included in the FASTQ file along with the sequence data. A Phred score is a numerical value that quantifies the probability of an incorrect base call at a specific position in a DNA sequence. It is expressed on a logarithmic scale and is sometimes known as a Q-score. A higher phred score indicates a higher level of confidence in the accuracy of a base call. for example, a Phred score of 30 means that there is a 1 in 1,000 chance (0.1%) of the base call being incorrect. A Phred score of 20 corresponds to a 1 in 100 chance (1%) of a base call being wrong. Based on the quality scores, sequences can be trimmed or filtered to improve the overall quality of the dataset.

Illumina sequencing typically produces sequences with an average Phred score of 30+. Quality scores are encoded in ASCII characters as integers have varying length. We don’t need to worry about interperting these as programs do this automaticlly.

Compression

Sequencing data is often “paired-end” meaning there will be two fastq files - one for all the 5’ - 3’ oriented reads and one for the 3’ -5’ reads. These files are typically delinated by the suffixes _R1 and R2 or simply _1 and _2. These fastq files contain lots of information and therefore can be very large. This can be mitigated through compression.

The most common method for compressing FASTQ files is g-zip. FASTQ files you recieve from the sequencer or that you download are often already compressed this way, which is indicated by the .gz extension. You can compress your own files (as long as they haven’t been compressed already) with:

gzip reads.fastq

This will convert the reads.fastq file into reads.fastq.gz in your working directory. To decompress the gzipped file, we simply:

gunzip reads.fastq.gzip

Unzipping is often unnecassary as most programs are built to handle gzipped files, so its best to leave them compressed to save space!

FASTA

FASTA files are similar to FASTQ files but only contain two lines instead of four:

ID ACTGATTGACTAGCAGTTTTGGACAGA

Line 1 always contains the “>” prefix, followed by the title or description of the sequence beneath
Line 2 contains the sequence data

FASTA differs from FASTQ as there is no quality information included. FASTA files are typically used for assemblies (which we will cover later) with each contig defined by the > character.

Sometimes, reads are converted to FASTA format for certain tools but this is uncommon. The general rule is that fastq files are for reads and fasta files are for assemblies.